Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning

نویسندگان

چکیده

Image captioning is oriented towards describing an image with the best possible use of words that can provide a semantic, relatable meaning scenario inscribed. Different models be used to accomplish this arduous task depending on context and requirement what needs achieved. An encoder–decoder model which uses feature vectors as input encoder often marked one appropriate process. In proposed work, dual-modal transformer has been captures intra- inter-model interactions in simultaneous manner within attention block. The architecture quantitatively evaluated publicly available Microsoft Common Objects Context (MS COCO) dataset yielding Bilingual Evaluation Understudy (BLEU)-4 Score 85.01. efficacy Flickr 8k, 30k datasets MS COCO results for same compared analysed state-of-the-art methods. shows outperformed when conventional models, such model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Image Captioning with Attention

In the past few years, neural networks have fueled dramatic advances in image classi cation. Emboldened, researchers are looking for more challenging applications for computer vision and arti cial intelligence systems. They seek not only to assign numerical labels to input data, but to describe the world in human terms. Image and video captioning is among the most popular applications in this t...

متن کامل

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevent...

متن کامل

Inter- and Intra-Domain Routing Interactions for MANETs

When making use of inter-domain routing in a MANET environment, certain interactions are required between the Exterior Gateway Protocol (EGP) and the Interior Gateway Protocol (IGP, such as OLSR or AODV). Unlike the norm in conventional fixed networks, many MANET protocols assume that every node in a network is a router, and as such a different mechanism to traditional dual-IGP/EGP functionalit...

متن کامل

Image Captioning with Sparse Lstm

Long Short-Term Memory (LSTM) is widely used to solve sequence modeling problems, for example, image captioning. We found the LSTM cells are heavily redundant. We adopt network pruning to reduce the redundancy of LSTM and introduce sparsity as new regularization to reduce overfitting. We can achieve better performance than the dense baseline while reducing the total number of parameters in LSTM...

متن کامل

Contrastive Learning for Image Captioning

Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied sciences

سال: 2022

ISSN: ['2076-3417']

DOI: https://doi.org/10.3390/app12136733